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(54) Information storage and retrieval 

(57) An information retrieval system in which a set 
of distinct inormation items map to respective nodes in 
an array nodes of a self-organising map by mutual sim- 
ilarity of the information items, so that similar information 
items map to nodes at similar positions in the array of 
nodes; comprises a data network; an information re- 
trieval client system connected to the data network; and 
one or more information item storage nodes connected 



to the data network; in which: each storage node com- 
prises means for storing a plurality of information items 
and means for transmitting data derived from informa- 
tion items stored at that storage node to the client sys- 
tem via the data network; and the client system com- 
prises means, responsive to data received from the in- 
dexing means of a storage node, for generating a node 
position in respect of each information item represented 
by the received data. 
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Description 

[0001] This invention relates to information storage 
and retrieval. 

[0002] There are many established systems for locat- 
ing information (e.g. documents, images, emails, pat- 
ents, internet content or media content such as audio/ 
video content) by searching under keywords. Examples 
include internet search "engines" such as those provid- 
ed by "Google" ™ or "Yahoo" ™ where a search carried 
out by keyword leads to a list of results which are ranked 
by the search engine in order of perceived relevance. 
[0003] However, in a system encompassing a large 
amount of content, often referred to as a massive con- 
tent collection, it can be difficult to formulate effective 
search queries to give a relatively short list of search 
"hits". For example, at the time of preparing the present 
application, a Google search on the keywords "massive 
document collection" drew 243000 hits. This number of 
hits would be expected to grow if the search were re- 
peated later, as the amount of content stored across the 
internet generally increases with time. Reviewing such 
a list of hits can be prohibitively time-consuming. 
[0004] In general, some reasons why massive con- 
tent collections are not well utilised are: 

• a user doesnt know that relevant content exists 

• a user knows that relevant content exists but does 
not know where it can be located 

• a user knows that content exists but does not know 
it is relevant 

• a user knows that relevant content exists and how 
to find it, but finding the content takes a long time 

[0005] The paper "Self Organisation of a Massive 
Document Collection", Kohonen et at, IEEE Transac- 
tions on Neural Networks, Vol 11 , No. 3, May 2000, pag- 
es 574-585 discloses a technique using so-called "self- 
organising maps" (SOMs). These make use of so-called 
unsupervised self learning neural network algorithms in 
which "feature vectors" representing properties of each 
document are mapped onto nodes of a SOM. 
[0006] In the Koh onen et al paper, a first step is to pre- 
process the document text, and then a feature vector is 
derived from each pre-processed document. In one 
form, this may be a histogram showing the frequencies 
of occurrence of each of a large dictionary of words. 
Each data value (i.e. each frequency of occurrence of a 
respective dictionary word) in the histogram becomes a 
value in an n-value vector, where n is the total number 
of candidate words in the dictionary (43222 in the ex- 
ample described in this paper). Weighting may be ap- 
plied to the n vector values, perhaps to stress the in- 
creased relevance or improved differentiation of certain 
words. 

[0007] The n-value vectors are then mapped onto 
smaller dimensional vectors (i.e. vectors having a 
number of values m (500 in the example in the paper) 



which is substantially less than n. This is achieved by 
multiplying the vector by an (n x m) "projection matrix" 
formed of an array of random numbers. This technique 
has been shown to generate vectors of smaller dimen- 

5 sion where any two reduced-dimension vectors have 
much the same vector dot product as the two respective 
input vectors. This vector mapping process is described 
in the paper "Dimensionality Reduction by Random 
Mapping: Fast Similarity Computation for Clustering", 

10 Kaski, Proc IJCNN, pages 413-418, 1998. 

[0008] The reduced dimension vectors are then 
mapped onto nodes (otherwise called neurons) on the 
SOM by a process of multiplying each vector by a "mod- 
el" (another vector). The models are produced by a 

15 learning process which automatically orders them by 
mutual similarity onto the SOM, which is generally rep- 
resented as a two-dimensional grid of nodes. This is a 
non-trivial process which took Kohonen et ai six weeks 
on a six-processor computer having 800 MB of memory, 

20 for a document database of just under seven million doc- 
uments. Finally the grid of nodes forming the SOM is 
displayed, with the user being able to zoom into regions 
of the map and select a node, which causes the user 
interface to offer a link to an internet page containing the 

25 document linked to that node. 

[0009] This invention provides an information retrieval 
system in which a set of distinct information items map 
to respective nodes in an array of nodes by mutual sim- 
ilarity of the inf ormation items, so that similar information 

30 items map to nodes at similar positions in the array of 
nodes; the system comprising: 

a data network; 

an information retrieval client system connected to 
35 the data network; and 

one or more (though preferably two or more) Infor- 
mation item storage nodes connected to the data 
network; 

40 in which: 

each storage node comprises means for storing a 
plurality of information items and indexing means 
for transmitting data derived from information items 
45 stored at that storage node to the client system via 
the data network; and 

the client system comprises means, responsive to 
data received from the indexing means of a storage 
node, for generating a node position in respect of 
so each information item represented by the received 
data. 

[0010] The invention provides an efficient and con- 
venient way of operating an information retrieval system 
55 over a network such as the internet. 

[001 1 ] Further respective aspects and features of the 

invention are defined in the appended claims. 

[001 2] The skilled man will realise that in the present 
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specification, within the normal usage of the word "list", 
the "data representing information items" could be the 
item itself, if it is of a size and nature appropriate for full 
display, or could be data indicative of the item. 
[001 3] Further respective aspects and features of the 
invention are defined in the appended claims. 
[0014] Embodiments of the invention will now be de- 
scribed, by way of example only, with reference to the 
accompanying drawings in which: 

Figure 1 schematically illustrates an information 
storage and retrieval system; 
Figure 2 is a schematic flow chart showing the gen- 
eration of a self-organising map (SOM); 
Figures 3a and 3b schematically illustrate term fre- 
quency histograms; 

Figure 4a schematically illustrates a raw feature 
vector, 

Figure 4b schematically illustrates a reduced fea- 
ture vector; 

Figure 5 schematically illustrates an SOM; 
Figure 6 schematically illustrates a dither process; 
Figures 7 to 9 schematically illustrate display 
screens providing a user interface to access infor- 
mation represented by the SOM; 
Figure 10 schematically illustrates a camcorder as 
an example of a video acquisition and/or processing 
apparatus; 

Figure 1 1 schematically illustrates a personal digital 
assistant as an example of portable data process- 
ing apparatus; and 

Figure 12 schematically illustrates a networked in- 
formation storage and retrieval system. 

[001 5] Figure 1 is a schematic diagram of an informa- 
tion storage and retrieval system based around a gen- 
eral-purpose computer 10 having a processor unit 20 
including disk storage 30 for programs and data, a net- 
work interface card 40 connected to a network 50 such 
as an Ethernet network or the Internet, a display device 
such as a cathode ray tube device 60, a keyboard 70 
and a user input device such as a mouse 80. The system 
operates under program control, the programs being 
stored on the disk storage 30 and provided: for example, 
by the network 50, a removable disk (not shown) or a 
pre-installatfon on the disk storage 30. 
[0016] The storage system operates in two general 
modes of operation. In a first mode, a set of information 
items (e.g. textual information items) is assembled on 
the disk storage 30 or on a network disk drive connected 
via the network 50 and Is sorted and indexed ready for 
a searching operation. The second mode of operation 
is the actual searching against the indexed and sorted 
data. 

[0017] The embodiments are applicable to many 
types of information items. A non-exhaustive list of ap- 
propriate types of information includes patents, video 
material, emails, presentations, internet content, broad- 



cast content, business reports, audio material, graphics 
and clipart, photographs and the like, or combinations 
or mixtures of any of these. In the present description, 
reference will be made to textual information items, or 

5 at least information items having a textual content or as- 
sociation. So, for example, a piece of broadcast content 
such as audio and/or video material may have associ- 
ated "MetaData" defining that material in textual terms. 
[0016] The information items are loaded onto the disk 

10 storage 30 in a conventional manner. Preferably, they 
are stored as part of a database structure which allows 
for easier retrieval and indexing of the items, but this is 
not essential. Once the information and items have been 
so stored, the process used to arrange them for search - 

is ing is shown schematically in Figure 2. 

[001 9] It will be appreciated that the indexed informa- 
tion data need not be stored on the local disk drive 30. 
The data could be stored on a remote drive connected 
to the system 10 via the network 50. Alternatively, the 

20 information may be stored in a distributed manner, for 
example at various sites across the internet. If the infor- 
mation is stored at different internet or network sites, a 
second level of information storage could be used to 
store locally a "link" (e.g. a URL) to the remote informa- 

25 tion, perhaps with an associated summary, abstract or 
MetaData associated with that link. So, the remotely 
held information would not be accessed unless the user 
selected the relevant link (e.g. from the results list 260 
to be described below), although for the purposes of the 

30 technical description which follows, the remotely held in- 
formation, or the abstractfeummary/MetaData. or the 
link/URL could be considered as the "information item". 
[0020] In other words, a formal definition of the "infor- 
mation item" is an item from which a feature vector is 

35 derived and processed (see below) to provide a map- 
ping to the SOM. The data shown in the results list 260 
(see below) may be the information item itself (if it is held 
locally and is short enough for convenient display) or 
may be data representing and/or pointing to the infor- 

40 mation item, such as one or more of MetaData, a URL, 
an abstract, a set of key words, a representative key 
stamp image or the like. This is inherent in the operation 
"list" which often, though not always, involves listing da- 
ta representing a set of items. 

45 [0021] In a further example, the information Items 
could be stored across a networked work group, such 
as a research team or a legal firm. A hybrid approach 
might involve some information items stored locally and/ 
or some information items stored across a local area 

so network and/or some information items stored across a 
wide area network. In this case, the system could be 
useful in locating similar work by others, for example in 
a large multi-national research and development organ- 
isation, similar research work would tend to be mapped 

55 to similar output nodes in the SOM (see below). Or, if a 
new television programme is being planned, the present 
technique could be used to check for its originality by 
detecting previous programmes having similar content. 
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[0022] It will also be appreciated that the system 1 0 
of Figure 1 is but one example of possible systems 
which could use the indexed information items. Al- 
though it is envisaged that the initial (indexing) phase 
would be carried out by a reasonably powerful compu- 
ter, most likely by a non -portable computer, the later 
phase of accessing the information could be carried out 
at a portable machine such as a "personal digital assist- 
ant" (a term for a data processing device with display 
and user input devices, which generally fits in one hand), 
a portable computer such as a laptop computer, or even 
devices such as a mobile telephone, a video editing ap- 
paratus or a video camera. In general, practically any 
device having a display could be used for the informa- 
tion-accessing phase of operation. 
[0023] The processes are not limited to particular 
numbers of Information items. 

[0024] The process of generating a self-organising 
map (SOM) representation of the information items will 
now be described with reference to Figures 2 to 6. Fig- 
ure 2 is a schematic flow chart illustrating a so-called 
"feature extraction" process followed by an SOM map- 
ping process. 

[0025] Feature extraction is the process of transform- 
ing raw data into an abstract representation. These ab- 
stract representations can then be used for processes 
such as pattern classification, clustering and recogni- 
tion. In this process, a so-called "feature vector" is gen- 
erated, which is an abstract representation of the fre- 
quency of terms used within a document. 
[0026] The process of forming the visualisation 
through creating feature vectors includes: 

• Create "document database dictionary" of terms 

- • Create "term frequency histograms" for each indi- 
vidual document based on the "document database 
dictionary* 

• Reduce the dimension of the "term frequency his- 
togram" using random mapping 

• Create a 2-dimensional visualisation of the informa- 
tion space. 

[0027] Considering these steps in more detail, each 
document (information item) 100 is opened in turn. At a 
step 110, all "stop words" are removed from the docu- 
ment Stop-words are extremely common words on a. 
pre-prepared list, such as "a", "the", "however", "about", 
"and", and "the". Because these words are extremely 
common they are likely, on average, to appear with sim- 
ilar frequency in all documents of a sufficient length. For 
this reason they serve little purpose in trying to charac- 
terise the content of a particular document and should 
therefore be removed. 

[0028] After removing stop-words, the remaining 
words are stemmed at a step 1 20, which involves finding 
the common stem of a word's variants. For example the 
words "thrower", "throws", and "throwing" have the com- 
mon stem of "throw". 



[0029] A "dictionary" of stemmed words appearing in 
the documents (excluding the "stop" words) is main- 
tained. As a word is newly encountered, it is added to 
the dictionary, and a running count of the number of 

5 times the word has appeared in the whole document col- 
lection (set of information items) is also recorded. 
[0030] The result is a list of terms used in all the doc- 
uments in the set, along with the frequency with which 
those terms occur. Words that occur with too high or too 

10 low a frequency are discounted, which is to say that they 
are removed from the dictionary and do not take part in 
the analysis which follows. Words with too low a fre- 
quency may be misspellings, made up, or not relevant 
to the domain represented by the document set Words 

15 that occur with too high a frequency are less appropriate 
for distinguishing documents within the set. For exam- 
ple, the term "News" Is used in about one third of all doc- 
uments In a test set of broadcast-related documents, 
whereas the word "football" is used in only about 2% of 

20 documents in the test set. Therefore "football" can be 
assumed to be a better term for characterising the con- 
tent of a document than "News". Conversely, the word 
"fottbalT (a misspelling of "football") appears only once 
in the entire set of documents, and so is discarded for 

25 having too low an occurrence. Such words may be de- 
fined as those having a frequency of occurrence which 
is lower than two standard deviations less than the mean 
frequency of occurrence, or which is higher than two 
standard deviations above the mean frequency of oc- 

30 currence. 

[0031] A feature vector is then generated at a step 
130. 

[0032] To do this, a term frequency histogram is gen- 
erated for each document in the set. A term frequency 

35 histogram is constructed by counting the number of 
times words present in the dictionary (pertaining to that 
document set) occur within an individual document. The 
majority of the terms in the dictionary will not be present 
in a single document, and so these terms will have a 

40 frequency of zero. Schematic examples of term frequen- 
cy histograms for two different documents are shown in 
Figures 3a and 3b. 

[0033] It can be seen from this example how the his- 
tograms characterise the content of the documents. By 

45 inspecting the examples it Is seen that document 1 has 
more occurrences of the terms "MPEG" and "Video" 
than document 2, which itself has more occurrences of 
the term "MetaData". Many of the entries in the histo- 
gram are zero as the corresponding words are not 

so present in the document. 

[0034] In a real example, the actual term frequency 
histograms have a very much larger number of terms in 
them than the example. Typically a histogram may plot 
the frequency of over 50000 different terms, giving the 

55 histogram a dimension of over 50000. The dimension of 
this histogram needs to be reduced considerably if it is 
to be of use in building an SOM information space. 
[0035] Each entry in the term frequency histogram is 
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used as a corresponding value in a feature vector rep- 
resenting that document. The result of this process is a 
(50000 x 1 ) vector containing the frequency of all terms 
specified by the dictionary for each document in the doc- 
ument collection. The vector may be referred to as 
"sparse" since most of the values will typically be zero, 
with most of the others typically being a very low number 
such as 1 . 

[0036] The size of the feature vector, and so the di- 
mension of the term frequency histogram, is reduced at 
a step 1 40. Two methods are proposed for the process 
of reducing the dimension of the histogram. 

i) Random Mapping - a technique by which the his- 
togram is multiplied by a matrix of random numbers. 
This is a computationally cheap process. 

ii) Latent Semantic Indexing - a technique whereby 
the dimension of the histogram is reduced by look- 
ing for groups of terms that have a high probability 
of occurring simultaneously in documents. These 
groups of words can then be reduced to a single 
parameter. This is a computationally expensive 
process. 

[0037] The method selected for reducing the dimen- 
sion of the term frequency histogram in the present em- 
bodiment is "random mapping", as explained in detail in 
the Kaski paper referred to above. Random mapping 
succeeds in reducing the dimension of the histogram by 
multiplying it by a matrix of random numbers. 
[0038] As mentioned above, the "raw" feature vector 
(shown schematically in Figure 4a) is typically a sparse 
vector with a size in the region of 50000 values. This 
can be reduced to a size of about 200 (see schematic 
Figure 4b) and still preserve the relative characteristics 
of the feature vector, that is to say, its relationship such 
as relative angle (vector dot product) with other similarly 
processed feature vectors. This works because al- 
though the number of orthogonal vectors of a particular 
dimension is limited, the number of nearly orthogonal 
vectors is very much larger. 

[0039] In fact as the dimension of the vector increases 
any given set of randomly generated vectors are nearly 
orthogonal to each other. This property means that the 
relative direction of vectors multiplied by this matrix of 
random numbers will be preserved. This can be dem- 
onstrated by showing the similarity of vectors before and 
after random mapping by looking at their dot product. 
[0040] It can be shown experimentally that reducing 
a sparse vector from 50000 values to 200 values pre- 
serves their relative similarities. However, this mapping 
is not perfect, but suffices for the purposes of charac- 
terising the content of a document in a compact way. 
[0041] Once feature vectors have been generated for 
the document collection, thus defining the collection's 
information space, they are projected into a two-dimen- 
sional SOM at a step 150 to create a semantic map. The 
following section explains the process of mapping to 2-D 



by clustering the feature vectors using a Kohonen self- 
organising map. Reference is also made to Figure 5. 
[0042] A Kohonen Self-Organising map is used to 
cluster and organise the feature vectors that have been 

5 generated for each of the documents. 

[0043] A self-organising map consists of input nodes 
170 and output nodes 180 in a two-dimensional array 
or grid of nodes illustrated as a two-dimensional plane 
1 85. There are as many input nodes as there are values 

10 in the feature vectors being used to train the map. Each 
of the output nodes on the map is connected to the input 
nodes by weighted connections 1 90 (one weight per 
connection). 

[0044] Initially each of these weights is set to a ran- 
15 dom value, and then, through an iterative process, the 
weights are "trained". The map is trained by presenting 
each feature vector to the Input nodes of the map. The 
"closest" output node Is calculated by computing the Eu- 
clidean distance between the input vector and weights 
20. of each of the output nodes. 

[0045] The closest node is designated the "winner" 
and the weights of this node are trained by slightly 
changing the values of the weights so that they move 
"closer" to the input vector. In addition to the winning 
25 node, the nodes in the neighbourhood of the winning 
node are also trained, and moved slightly closer to the 
input vector. 

[0046] It is this process of training not just the weights 
of a single node, but the weights of a region of nodes 
30 on the map, that allow the map, once trained, to pre- 
serve much of the topology of the input space in the 2-D 
map of nodes. 

[0047] Once the map is trained, each of the docu- 
ments can be presented to the map to see which of the 

35 output nodes is closest to the input feature vector for 
that document. It is unlikely that the weights will be iden- 
tical to the feature vector, and the Euclidean distance 
between a feature vector and its nearest node on the 
map is known as its "quantisation error". 

40 [0048] By presenting the feature vector for each doc- 
ument to the map to see where it lies yields and x, y map 
position for each document. These x, y positions when 
put in a look up table along with a document ID can be 
used to visualise the relationship between documents. 

45 [0049] Finally, a dither component is added at a step 
160, which will be described with reference to Figure 6 
below. 

[0050] A potential problem with the process described 
above is that two identical, or substantially identical, in- 

50 formation items may be mapped to the same node in 
the array of nodes of the SOM. This does not cause a 
difficulty in the handling of the data, but does not help 
with the visualisation of the data on display screen (to 
be described below). In particular, when the data is vis- 

55 ualised on a display screen, it has been recognised that 
it would be useful for multiple very similar items to be 
distinguishable over a single item at a particular node. 
Therefore, a "dither" component is added to the node 
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position to which each information item is mapped. The 
dither component is a random addition of up to ±V 2 of 
the node separation. So, referring to Figure 6, an infor- 
mation item for which the mapping process selects an 
output node 200 has a dither component added so that 
it in fact may be mapped to any node position within the 
area 210 bounded by dotted lines on Figure 6. 
[0051] So, the information items can be considered to 
map to positions on the plane of Figure 6 at node posi- 
tions other than the "output nodes" of the SOM process. 
[0052] An alternative approach might be to use a 
much higher density of "output nodes" in the SOM map- 
ping process described above. This would not provide 
any distinction between absolutely identical information 
items, but may allow almost, but not completely, identi- 
cal information items to map to different but closely 
spaced output nodes. 

[0053] Figure 7 schematically illustrates a display on 
the display screen 60 in which data sorted into an SOM 
is graphically illustrated for use in a searching operation. 
The display shows a search enquiry 250, a results list 
260 and an SOM display area 270. 
[0054] In operation, the user types a key word search 
enquiry into the enquiry area 250. The user then initiates 
the search, for example by pressing enter on the key- 
board 70 or by using the mouse 80 to select a screen 
"button" to start the search. The key words in the search 
enquiry box 250 are then compared with the information 
items in the database using a standard keyword search 
technique. This generates a list of results, each of which 
is shown as a respective entry 280 in the list view 260. 
Also, each result has a corresponding display point on 
the node display area 270. 

[0055] Because the sorting process used to generate 
the SOM representation tends to group mutually similar 
information items together in the SOM, the results for 
the search enquiry generally tend to fall in clusters such 
as a cluster 290. Here, it is noted that each point on the 
area 270 corresponds to the respective entry in the SOM 
associated with one of the results in the result list 260; 
and the positions at which the points are displayed with- 
in the area 270 correspond to the array positions of 
those nodes within the node array. 
[0056] Figure 8 schematically illustrates a technique 
for reducing the number of "hits" (results in the result 
list). The user makes use of the mouse 60 to draw a box 
300 around a set of display points corresponding to 
nodes of interest. In the results list area 260, only those 
results corresponding to points within the box 300 are 
displayed. If these results turn out not to be of interest, 
the user may draw another box encompassing a differ- 
ent set of display points. 

[0057] It is noted that the results area260 displays list 
entries for those results for which display points are dis- 
played within the box 300 and which satisfied the search 
criteria in the word search area 250. The box 300 may 
encompass other display positions corresponding to 
populated nodes in the node array, but if these did not 



satisfy the search criteria they will not be displayed and 
so will not form part of the subset of results shown in the 
box 260. 

[0058] Rgure 9 schematically illustrates a technique 
5 for detecting the node position of an entry in the list view 
260. Using a standard technique in the field of graphical 
user interfaces, particularly in computers using the so- 
called "Windows" TM operating system, the user may 
"select" one or more of the entries in the results list view. 
10 in the examples shown, this is done by a mouse click 
on a "check box" 310 associated with the relevant re- 
sults. However, it could equally be done by clicking to 
highlight the whole result, or by double-clicking on the 
relevant result and so on. As a result is selected, the 
15 corresponding display point representing the respective 
node in the node array is displayed in a different manner. 
This is shown schematically for two display points 320 
corresponding to the selected results 330 In the results 
area 260. 

20 [0059] The change in appearance might be a display 
of the point in a larger size, or in a more intense version 
of the same display colour, or in a different display col- 
our, or in a combination of these varying attributes. 
[0060] At any time, a new information item can be 

25 added to the SOM by following the steps outlined above 
(i.e. steps 110 to 140) and then applying the resulting 
reduced feature vector to the "p re-trained" SOM mod- 
els, that is to say. the set of SOM models which resulted 
from the self-organising preparation of the map. So, for 

30 the newly added information item, the map is not gen- 
erally "retrained"; instead steps 150 and 160 are used 
with all of the SOM models not being amended. To re- 
train the SOM every time a new information item is to 
be added is computationally expensive and is also 

35 somewhat unfriendly to the user, who might grow used 
to the relative positions of commonly accessed informa- 
tion items in the map. 

[0061 ] However, there may well come a point at which 
a retraining process is appropriate. For example, if new 

40 terms (perhaps new items of news, or a new technical 
field) have entered into the dictionary since the SOM 
was first generated, they may not map particularly well 
to the existing set of output nodes. This can be detected 
as an increase In a so-called "quantisation error" detect - 

is ed during the mapping of newly received information 
Item to the existing SOM. In the present embodiments, 
the quantisation error is compared to a threshold error 
amount. If it is greater than the threshold amount then 
either (a) the SOM is automatically retrained, using all 

50 of fts original information items and any items added 
since its creation; or (b) the user is prompted to initiate 
a retraining process at a convenient time. The retraining 
process uses the feature vectors of all of the relevant 
information items and reapplies the steps 150 and 160 

55 in full. 

[0062] Rgure 1 0 schematically illustrates a camcord- 
er 500 as an example of a video acquisition and/or 
processing apparatus, the camcorder including an im- 
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age capture device 51 0 with an associated lens 520; a 
data/signal processor 530; tape storage 540; disk or oth- 
er random access storage 550; user controls 560; and 
a display device 570 with eyepiece 580. Other features 
of conventional camcorders or other alternatives (such 
as different storage media or different display screen ar- 
rangements) will be apparent to the skilled man. In use, 
MetaData relating to captured video material may be 
stored on the storage 550, and an SOM relating to the 
stored data viewed on the display device 570 and con- 
trolled as described above using the user controls 560. 
[0063] Figure 11 schematically illustrates a personal 
digital assistant (PDA) 600, as an example of portable 
data processing apparatus, having a display screen 610 
including a display area 620 and a touch sensitive area 
630 providing user controls; along with data processing 
and storage (not shown). Again, the skilled man will be 
aware of alternatives In this field. The PDA may be used 
as described above in connection with the system of Fig- 
ure 1. 

[0064] Figure 12 schematically illustrates a net- 
worked information storage and retrieval apparatus. 
The system may operate under software control as de- 
scribed earlier. 

[0065] The functionality of the arrangement of Figure 
1 and the subsequent description is achieved in a net- 
worked system, with some additional features to en- 
hance the efficiency of use of the networked system. 
[0066] In general terms, the operation is divided be- 
tween a client system 800 and one or more storage 
nodes 81 0, the client system and the storage nodes be- 
ing connected to one another by a networked connec- 
tion such as an internet connection 820. In Figure 12 
schematic connections are shown between each stor- 
age node 810 and the client system. Many network ar- 
rangements including the internet will notionalty provide 
a physical connection between all of the nodes connect- 
ed to that network, including between pairs of storage 
nodes 810. However the connections in Figure 12 are 
intended to represent logical data paths between the dif- 
ferent nodes. 

[0067] A search engine or internet search provider 
(server) 830, for example the known Google*™ search 
provider, may also be logically connected to the client 
system. 

[0068] The client system 800 comprises display / user 
interface logic 840 providing (or being connectable to) 
a user display operating as described above, content 
organisation service logic 850 and index service logic 
860. Each storage node comprises information storage 
(e.g. disk storage) 870, optional metadata extraction 
logic 880 and index agent logic 890. Apart from any in- 
formation held at the search engine 830, the information 
storage 870 of the storage nodes is the primary repos- 
itory of the information items in this embodiment. How- 
ever, it will be appreciated that this is just for the purpos- 
es of the present example; there is no technical reason 
why information items could not also be stored "locally", 



i.e. at the client system. 

[0069] The client system provides the following func- 
tionality described earlier: 

5 • optionally, the functionality of Figure 2 and subse- 
quent description, i.e. the generation of an SOM (al- 
though the SOM representation could have been 
generated elsewhere) 

• some or all of the functionality of Figures 7 to 9, i.e. 
io the display of the SOM representation and interface 

with the user in handling the SOM representation 

• at least part of the functionality of adding a newly 
received information item to an "already trained" 
SOM representation, optionally including the func- 

15 tionality of initiating a retraining process. It is noted 
that some steps, such as the steps 110 and 120, 
may be carried out at the storage node rather than 
at the client system. 

[0070] In basicterms, the index agent at each storage 
node derives data (e.g. by steps corresponding to the 
steps 110, 120) from textual matter either contained in 
an information item stored at that node or derived from 
such an information item by the metadata extraction log- 
ic 880 (e.g. in respect of information items consisting at 
least primarily of audio/video material). The resulting da- 
ta is then forwarded to the indexing service logic 860 of 
the client system. This can take place in one or more of 
several ways: 

• the index agent can forward a batch of data repre- 
senting data derived from an information ftem as 
that information item is detected to be newly stored 
or newly modified 

• the index agent can forward a batch of data repre- 
senting data derived from alt information items held 
at that storage node, in response to a search query 
(or an information retrieval query operation) at the 
client system 

• the index agent can forward a batch of data repre- 
senting data derived from all information items held 
at that storage node, in response to a certain length 
of time having passed since it last did so 

• the index agent can maintain a register of those in- 
formation Items for which data has already been for- 
warded to the client system, and those for which it 
has not. In response to a search query (or an infor- 
mation retrieval query operation) at the client sys- 
tem, the index agent can forward some or all of the 
"not yet forwarded" data, as one or more batches 
of data. Information items for which data has been 
forwarded in this way are moved from the "not yet 
forwarded" list to the "forwarded" list at that storage 
node's index agent. 

[0071] The data forwarded to the client system can 
be, for example, one or more of: 
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(a) the information item itself (or at least a textual 
part thereof) 

(b) metadata (e.g. text data) derived from the infor- 
mation item 

(c) the results of step 11 0 as carried out on (a) or (b) £ 

(d) the results of step 120 as earned out on (a) or (b) 

(e) a feature vector derived from (a) or (b) 

[0072] At the client system, when any of (a) to (d) is 
received from an index agent, the content organisation 10 
service logic generates a feature vector and, from that, 
an SOM map position, which is stored at the client sys- 
tem, along with an identifier of the information item (e. 
g. a URL or URI - universal resource indicator) which 
identifies where the information item is stored. If (e) is 
received, an SOM map position is generated and stored 
at the client system along with a URL7URI. 
[0073] When the user generates a query, the user 
control (input to the logic 840) is passed to the index 
service logic 860 which then distributes it to the nodes 20 
connected to the network. They respond with data as 
described above, which is assimilated into the SOM rep- 
resentation for display to the user. 
[0074] Instead of a storage node as described above, 
the indexing service logic may receive similar data from 25 
an Internet search engine such as Google**™. This data 
is handled in the same way as already described. The 
transmission of the data form the search engine to the 
indexing service may be initiated in any of the ways de- 
scribed above. so 

PREFERRED FEATURES OF THE INVENTION 

[0075] Various preferred features of the invention are 
also defined in the following numbered paragraphs. 35 

1. An information retrieval system such as that de- 
scribed with reference to Figure 12 in which a set 
of distinct information items map to respective 
nodes in an array of nodes by mutual similarity of 40 
the information items, so that similar information 
items map to nodes at similar positions in the array 

of nodes; the system comprising: a graphical user 
interface for displaying a representation of at least 
some of the nodes as a two-dimensional display ar- 45 
ray of display points within a display area on a user 
display; a user control for defining a two-dimension- 
al region of the display area; and a detector for de- 
tecting those display points lying within the two-di- 
mensional region of the display area; the graphical so 
user interface also displaying a list of data repre- 
senting information items, being those information 
items mapped onto nodes corresponding to display 
points displayed within the two-dimensional region 
of the display area. 55 

2. A system according to paragraph 1 , in which the 
information items are mapped to nodes in the array 
on the basis of a feature vector derived from each 



information item. 

3. A system according to paragraph 2, in which the 
feature vector for an information item represents a 
set of frequencies of occurrence, within that infor- 
mation item, of each of a group of information fea- 
tures. 

4. A system according to paragraph 3, in which the 
information items comprise textual information, the 
feature vector for an information item represents a 
set of frequencies of occurrence, within that infor- 
mation item, of each of a group of words. 

5. A system according to paragraph 1 or paragraph 
2, in which the information items comprise textual 
information, the nodes being mapped by mutual 
similarity of at least a part of the textual information. 

6. A system according to paragraph 4 or paragraph 
5, in which the information Items are pre-processed 
for mapping by excluding words occurring with 
more than a threshold frequency amongst the set 
of information items. 

7. A system according to any one of paragraphs 4 
to 6, in which the information items are pre-proc- 
essed for mapping by excluding words occurring 
with less than a threshold frequency amongst the 
set of information items. 

8. A system according to any one of paragraphs 4 
to 7, comprising: search means for carrying out a 
word-related search of the information items; the 
search means and the graphical user interface be- 
ing arranged to co-operate so that only those dis- 
play points corresponding to information items se- 
lected by the search are displayed. 

9. A system according to any one of the preceding 
paragraphs, in which the mapping between infor- 
mation items and nodes in the array includes a dith- 
er component so that substantially identical infor- 
mation items tend to map to closely spaced but dif- 
ferent nodes in the array. 

1 0. A system according to any one of the preceding 
paragraphs, comprising a user control for choosing 
one or more information items from the list; the 
graphical user interface being operable to alter the 
manner of display within the display area of display 
points corresponding to selected information items. 

11. A system according to paragraph 10, in which 
the graphical user interface is operable to display 
in a different colour and/or intensity those display 
points corresponding to information items chosen 
within the list. 

1 2. An information storage system in which a set of 
distinct information items are processed so as to 
map to respective nodes in an array of nodes by 
mutual similarity of the information items, such that 
similar information items map to nodes at similar po- 
sitions in the array of nodes; the system comprising: 
means for generating a feature vector derived from 
each information item, the feature vector for an in- 
formation item representing a set of frequencies of 
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occurrence, within that information item, of each of 
a group of information features; and means for map- 
ping each feature vector to a node in the array of 
nodes, the mapping between information items and 
nodes in the array including a dither component so 
that substantially identical information items tend to 
map to closely spaced but different nodes in the ar- 
ray. 

13. A system according to paragraph 12, compris- 
ing: means for mapping a newly received informa- 
tion item to a node in the array of nodes; means for 
detecting a mapping error as the newly received in- 
formation item is so mapped; and means respon- 
sive to a detection that the mapping error exceeds 
a threshold error amount, for initiating a remapping 
process of the set of information items and the new- 
ly received information Item. 

17. An information storage method in which a set of 
distinct information items are processed so as to 
map to respective nodes in an array of nodes by 
mutual similarity of the information items, such that 
similar information items map to nodes at similar po- 
sitions in the array of nodes; the method comprising 
the steps of: generating a feature vector derived 
from each information, the feature vector for an in- 
formation item representing a set of frequencies of 
occurrence, within that information item, of each of 
a group of information features; and mapping each 
feature vector to a node in the array of nodes, the 
mapping between information items and nodes in 
the array including a dither component so that sub- 
stantially identical information items tend to map to 
closely spaced but different nodes in the array. 

18. An information retrieval method in which a set 
of distinct information items map to respective 
nodes in an array of nodes by mutual similarity of 
the information Items, so that similar information 
items map to nodes at similar positions in the array 
of nodes; the method comprising: displaying a rep- 
resentation of at least some of the nodes as a two- 
dimensional display array of display points within a 
display area on a user display; defining, with a user 
control, a two-dimensional region of the display ar- 
ea; detecting those display points lying within the 
two-dimensional region of the display area; and dis- 
playing a list of data representing information Items, 
being those information items mapped onto nodes 
corresponding to display points displayed within the 
two-dimensional region of the display area. 



Claims 

1 . An information retrieval system in which a set of dis- 
tinct information items map to respective nodes in 
an array of nodes by mutual similarity of the infor- 
mation items, so that similar information items map 
to nodes at similar positions in the array of nodes; 



the system comprising: 
a data network; 

an information retrieval client system connect- 
5 ed to the data network; and 

one or more information item storage nodes 
connected to the data network; 

in which: 

w 

each storage node comprises means for stor- 
ing a plurality of information items and indexing 
means for transmitting data derived from infor- 
mation items stored at that storage node to the 

15 client system via the data network; and 

the client system comprises means, responsive 
to data received from the Indexing means of a 
storage node, for generating a node position in 
respect of each information item represented 

20 by the received data. 

2. A system according to claim 1 , in which the indexing 
means at each storage node is operable to transmit 
data to the client system to the client system in 
25 batches; each batch comprising at least data de- 
rived from some of those information items stored 
at that storage node for which data has not previ- 
ously been transmitted to the client system. 

30 3. A system according to claim 2, in which each batch 
of data comprises data derived from those Informa- 
tion items stored at that storage node for which data 
has not previously been transmitted to the client 
system. 

35 

4. A system according to any one of claims 1 to 3, in 
which the indexing means at each storage node is 
operable to transmit to the client system a batch of 
data derived from information items stored at that 

40 storage node in response to an information retrieval 
operation at the client system. 

5. A system according to any one of claims 1 to 3, in 
which the indexing means at each storage node is 

45 operable to detect an information Item which is 
modified or newly stored at that storage node and, 
in response to such a detection, to send a batch of 
data derived from that information item to the client 
system. 

so 

6. . A system according to any one of the preceding 
claims, in which the data network is an internet net- 
work. 

55 7. A system according to claim 6, in which one or more 
of the storage nodes are internet search servers. 

8. A system according to any one of the preceding 
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claims, in which: 

the information items are at least partially tex- 
tual; and 

the data derived form a stored information item 5 
comprises the whole of the textual content of 
that information item. 

9. A system according to any one of claims 1 to 7, in 
which the data derived from a stored information 10 
item comprises textual data indicative of the content 

of the stored information item. 

10. A system according to any one of the preceding 
claims, in which the client system comprises a '5 
graphical user interface for displaying a represen- 
tation of at least some of the nodes as a two-dimen- 
sional display array of display points within a display 
area on a user display. 

20 

11. A system according to claim 10, in which the client 
system comprises: 

a user control for defining a two-dimensional re- 
gion of the display area; and 25 
a detector for detecting those display points ly- 
ing within the two-dimensional region of the dis- 
play area. 

12. A system according to claim 11 , in which the graph- 30 
ical user interface is operable to display a list of data 
representing information items, being those infor- 
mation items mapped onto nodes corresponding to 
display points displayed within the two-dimensional 
region of the display area. 35 

13. A system according to claim 12, in which the client 
system comprises a user control for choosing one 
or more information items from the list; the graphical 
user interface being operable to alter the manner of *o 
display within the display area of display points cor- 
responding to selected information items. 

14. A system according to any one of the preceding 
claims, in which the data derived from an informa- 45 
tion item Includes an identification of the storage lo- 
cation of that information item. 

15. A system according to claim 14, in which the iden- 
tification comprises a universal resource indicator so 
(URI). 

16. An information storage node for use in an informa- 
tion retrieval system in which a set of distinct infor- 
mation items map to respective nodes in an array 55 
of nodes by mutual similarity of the information 
items, so that similar information items map to 
nodes at similar positions in the array of nodes; the 



storage node being connected via a data network 
to an information retrieval client system having 
means, responsive to data received from the stor- 
age node, for generating a node position in respect 
of each information item represented by the re- 
ceived data; the storage node comprising: 

means for storing a plurality of information 
items and indexing means for transmitting data 
derived from information items stored at that 
storage node to the client system via the data 
network. 

17. An information retrieval client system in which a set 
of distinct information items map to respective 
nodes in an array of nodes by mutual similarity of 
the Information items, so that similar information 
items map to nodes at similar positions in the array 
of nodes; the client system being connectable via a 
data network to one or more information item stor- 
age nodes each comprising means for storing a plu- 
rality of information items and indexing means for 
transmitting data derived from information items 
stored at that storage node to the client system via 
the data network; 

the client system comprising means, respon- 
sive to data received from the indexing means of a 
storage node, for generating a node position in re- 
spect of each information item represented by the 
received data. 

1 8. A portable data processing device comprising a cli- 
ent system according to claim 17. 

19. Video acquisition and/or processing apparatus 
comprising a client system according to claim 17. 

20. An information retrieval method in which a set of dis- 
tinct information items map to respective nodes in 
an array of nodes by mutual similarity of the infor- 
mation items, so that similar information items map 
to nodes at similar positions in the array of nodes 
in a system comprising a data network, an informa- 
tion retrieval client system connected to the data 
network, and one or more information Item storage 
nodes connected to the data network; 

the method comprising the steps of: 

each storage node storing a plurality of infor- 
mation items; 

each storage node transmitting data derived 
from information items stored at that storage 
node to the ciient system via the data network; 
and 

the client system, responsive to data received 
from the Indexing means of a storage node, 
generating a node position in respect of each 
information item represented by the received 



10 



19 



EP 1 400 903 A1 



20 



data. 



21. A method of operation of an information storage 
node for use in an information retrieval system in 
which a set of distinct information items map to re- s 
spective nodes in an array of nodes by mutual sim- 
ilarity of the information items, so that similar infor- 
mation items map to nodes at similar positions in 
the array of nodes; the storage node being connect- 
able via a data network to an information retrieval 10 
client system having means, responsive to data re- 
ceived from the storage node, for generating a node 
position in respect of each information item repre- 
sented by the received data; the method comprising 
the steps of: 15 

storing a plurality of information items; and 
transmitting data derived from information 
items stored at that storage node to the client 
system via the data network. 20 

22. A method of operation of an inf ormation retrieval cli- 
ent system in which a set of distinct information 
items map to respective nodes in an array of nodes 

by mutual similarity of the information items, so that 25 
similar information items map to nodes at similar po- 
sitions in the array of nodes; the client system being 
connectable via a data network to one or more in- 
formation item storage nodes each comprising 
means for storing a plurality of information items 30 
and indexing means for transmitting data derived 
from information items stored at that storage node 
to the client system via the data network; 

the method comprising, responsive to data re- 
ceived from the indexing means of a storage node, 35 
generating a node position in respect of each infor- 
mation item represented by the received data. 

23. Computer software comprising program code for 
carrying out a method according to any one of *o 
claims 20 to 22. 

24. A providing medium for providing software accord- 
ing to claim 23. 

45 

25. A medium according to claim 24, the medium being 
a storage medium. 

26. A medium according to claim 24, the medium being 

a transmission medium. 50 
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